Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In this submission I improve upon a baseline of PR #800 (bpb 0.5654) by introducing Hammerstein-Wiener Neural ODEs (HWNODEs).
What is HWNODE?
HWNODE is a novel weight-shared continuous-depth replacement for MLPs that can be more parameter efficient than MLPs at small parameter counts, and may also permit dynamic compute at larger scales. This works by borrowing from Hammerstein-Wiener models from control theory by placing a linear ODE between two non-linearities to behave like a non-linear ODE. This is useful, since the linear core admits a closed-form matrix-exponential form, which we approximate with a low-order Taylor truncation instead of using an iterative ODE solver. By repeatedly applying the same HW block across different virtual depth states, we obtain multiple unique non-linear layers from one shared parameter set. To keep this stable (||A|| <= 1), we use spectral normalization.
How did I get here?
When looking at the state of parameter golf records within the first few days, the largest improvements seemed to come from managing to add another layer without significantly reducing width. This holds with the common wisdom that depth > width, especially at this scale. I also did not believe that the MLPs are maximally parameter efficient since you can quantize the weights to half the size without breaking them or losing nearly that much information/reasoning abilities. Given this, I began looking at a way to loop through the same weights while avoiding credit assignment problems or performance issues. This is a hard problem to solve, but there were two main ways to think about this. First off, you can treat this as an equilibrium problem, leading to DEQs. These apply one layer repeatedly until convergence. Because of the repetition, these are too slow to train and run for this challenge. Alternatively to this, you can learn a differential equation. This allows modeling of complex dynamics with respect to another variable. This is called a neural ODE (Ordinary Differential Equation), and we model this dynamic with respect to depth. To make this practical, we approximate exp(AΔt) with a Taylor polynomial and place nonlinear maps around the linear ODE core at each shared depth step. While this lets us effectively model depth and generate virtual layers, this too has a few issues. First off, and most glaring, speed. A generic nonlinear ODE would require iterative numerical integration, which is too expensive for this setting. To resolve this, we can take a page out of control theory (Hammerstein-Wiener models) and make the ODE linear (thus solvable efficiently) while wrapping it in non-linearities. Secondly, instability. If the neural ODE diverges, then the gradients can explode or behave unpredictably. This is easy to resolve using spectral normalization (keeps ||A|| <= 1, bounding the operator norm of the linear dynamics and greatly improving stability). This architecture generates virtual depth by reusing the same Hammerstein-Wiener block across repeated shared depth steps, while the Taylor expansion approximates the linear flow inside each step. This results in a performant and theoretically unbounded parameter-sharing architecture.
Data
When testing HWNODE against an MLP baseline, I performed both LM testing via parameter golf and RL experiments. In RL, HWNODE looks compelling as a parameter-efficient policy network. This is because we can create virtual depth with the same parameter count, which at these scales is far more important than width (though we do not sacrifice width either). As a language model, however, HWNODE is unable to beat the strongest MLPs at the same scale. However, HWNODE degrades much less aggressively with quantization, so it wins when the sizes are scaled.
Reinforcement Learning: LunarLander-v3 (PPO, 500K steps, 3 seeds)
This experiment measures the final mean reward over the last 100 episodes of training. A score above 200 is solved. The narrow MLP remains the top performer, but fixed-Taylor HWNODE is more parameter-efficient. In this test, a 6.3k-parameter HWNODE solves the task across all seeds, and the scaled HWNODE remains competitive with much larger MLP baselines.
mlp-narrowmlp-mediummlp-largehwnode-standardhwnode-scaledtaylor-learnedchebyshev-learnedcheb-ortho-initcheb-ortho-paramchebyshev-scaledThe most important result here is that fixed-Taylor HWNODE remains competitive under heavy compression.
hwnode-standardsolves LunarLander at only 6.3k parameters, whilehwnode-scaledapproaches the performance of much larger MLPs. At the same time, the best absolute mean reward still belongs tomlp-narrow, so the RL evidence supports HWNODE as parameter-efficient and competitive rather than universally superior.Parameter Golf HWNODE vs Parameter-Matched MLP (RX 6800 XT, 10-minute proxy, lower is better)
These tests compare HWNODE against an MLP of roughly similar parameter count. Because sliding-window evaluation was too slow to run consistently on this hardware, the most useful final comparison metric here is
final_int6_roundtrip_exact.final_int6_roundtrip_exactMLP_MULT=1.0HWNODE_STATE_DIM=384, ORDER=2, VDEPTH=2HWNODE_STATE_DIM=384, ORDER=2, VDEPTH=6The important pattern is that the MLP baseline wins in full precision, but the simpler HWNODE variants continue to perform well under quantization. The
ORDER=2, VDEPTH=2HWNODE is currently the strongest compressed model, beating the roughly parameter-matched MLP onfinal_int6_roundtrip_exactdespite having worse online validation before export. Increasing virtual depth from 2 to 6 improves the online validation number, but loses some of that gain after quantization, suggesting a tradeoff between representational power and quantization robustness.These experiments suggest two complementary conclusions. In RL, HWNODE is strongest as a highly parameter-efficient alternative to an MLP, remaining competitive even under severe compression. In parameter-golf, the corrected shared-depth HWNODE is not better than MLPs in full precision, but it appears substantially more robust under int6 export. The results from virtual depth 6 after quantization, however, suggest it may be possible to learn a more useful representation than the space gains from quantization. Further experimentation is required.
Conclusion
In this PR I introduce a novel method for generating virtual layers during runtime from a single set of parameters. This architecture has been shown to be at least equivalently expressive to MLPs across a limited variety of RL Tests, and more parameter efficient in extremely compressed scenarios. This has shown promise in language modeling, and has much potential for future exploration. Some areas for exploration include alternative expansion methods to taylor series (escaping the 1/n! limitation), learnable virtual layer count (dynamic thinking), using different activation functions (and/or using them in different places), or using a learnable taylor (learnable k), among other things.